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SYSTEM AND METHOD FOR 
IMPLEMENTING ERROR DETECTION AND 
RECOVERY IN A SYSTEM AREA NETWORK 

CROSS-REFERENCES TO RELATED 
APPLICATIONS 

This application claims priority from Provisional App. 
No. 60/070,650, filed Jan. 7, 1998, which is incorporated 
herein by reference. 

BACKGROUND OF THE INVENTION 

Traditional network systems utilize either channel seman- 
tics (send/receive) or memory semantics (DMA) model. 
Channel semantics tend to be used in I/O environments and 
memory semantics tend to be used in processor environ- 
ments. 

In a channel semantics model, the sender does not know 
where data is to be stored, it just puts the data on the channel. 
On the sending side, the sending process specifies the 
memory regions that contain the data to be sent. On the 
receiving side, the receiving process specifies the memory 
regions where the data will be stored. 

In the memory semantics model, the sender directs the 
data to a particular location in the memory, utilizing remote 
direct memory access (RDMA) transactions. The initiator of 
the data transfer specifies both the source buffer and desti- 
nation buffer of the data transfer. There are two types of 
RDMA operations, read and write. 

The virtual interface architecture (VIA) has been jointly 
developed by a number of computer and software compa- 
nies. VIA provides consumer processes with a protected, 
directly accessible interface to network hardware, termed a 
virtual interface. VIA is especially designed to provide low 
latency message communication over a system area network 
(SAN) to facilitate multi-processing utilizing clusters of 
processors. 

A SAN is used to interconnect nodes within a distributed 
computer system, such as a cluster. The SAN is a type of 
network that provides high bandwidth, low latency commu- 
nication with a very low error rate. SANs often utilize 
fault-tolerant capability. 

It is important for the SAN to provide high reliability and 
high-bandwidth, low latency communication to fulfill the 
goals of the VIA Further, it is important for the SAN to be 
able to recover from errors and continue to operate in the 
event of equipment failures. Error recovery must be accom- 
plished without high CPU overhead associated with all 
transactions. Furthermore, error recovery should not 
increase the complexity for the consumer of VIA services. 

SUMMARY OF THE INVENTION 

According to one aspect of the present invention, a SAN 
maintains local copies of a sequence number for each data 
transfer transaction at the requester and responder nodes. 
Each data transfer is implemented by the SAN as a sequence 
of request/response packet pairs. An error condition arises if 
a response to any request packet is not received at the 
requesting node. The responder and requestor nodes are 
coupled by a plurality of paths and each node maintains a 
record of the good or bad status of each path. If a transaction 
fails and the path is permanently bad, both nodes update 
their status to indicate that the path is bad. This is to prevent 
further transactions from including any stale requests that 
are potentially still in the network from arriving at the 
destination and potentially corrupting data. 
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According to another aspect of the invention, if an error 
occurs on a path the requestor node implements a barrier 
transaction on the path to determine if the failure is perma- 
nent or transient. 

5 According to another aspect of the invention, the barrier 
transaction is performed by writing a number chosen from a 
large number space in a way that minimizes the probability 
of reusing the number in a short period of time. 
According to one aspect of the invention, the number is 

10 randomly chosen from a large number space. 

According to another aspect of the invention, the large 
number is based on the requestor ID and an incrementing 
component managed by the requestor. 

15 According to another aspect of the invention, if the failure 
is transient the requestor retransmits packets starting with 
the packet that first caused an error condition to be detected. 

According to another aspect of the invention, a sequence 
number is included in each request packet and copied into 

20 each response packet. A local copy of the sequence number 
is maintained at the requestor and responder nodes. If the 
sequence number in the request packet does not match the 
sequence number at the responder, a negative acknowledge 
response packet is generated. 

25 Other features and advantages of the invention will be 
apparent in view of the following detailed description and 
appended drawings. 

BRIEF DESCRIPTION OF THE DRAWINGS 

30 FIG. 1 is a block diagram depicting ServerNet protocol 
layers implemented by hardware, where ServerNet is a SAN 
manufactured by the assignee of the present invention; 
FIGS. 2 and 3 are block diagrams depicting SAN topolo- 

35 gics; 

FIG. 4 is a schematic diagram depicting logical paths 
between end nodes of a SAN; 

FIG. 5 is a schematic diagram depicting routers and links 
connecting SAN end nodes; 

40 FIG. 6 is a graph depicting the transmission of request and 
response packets between a requestor and a responder end 
node. FIG. 6 shows the sequence numbers used in packets 
for three Send operations, an RDMA operation, and two 
additional Send operations. The diagram shows the 

45 sequence numbers maintained in the requestor logic, the 
sequence number contained in each packet, and the sequence 
numbers maintained at the responder logic; 

FIG. 7 is two interlocked state diagrams showing the state 

5Q that software on the requestor and responder moves through 
for each path; 

FIG. 8 is a graph depicting retransmission during error 
recovery due to a lost request packet; and 
FIG. 9 is a graph depicting retransmission during error 
55 recovery due to a lost acknowledgment packet. 

DESCRIPTION OF THE EMBODIMENTS 

The preferred embodiments will be described imple- 
mented in the ServerNet II (ServerNet) architecture, manu- 

60 factured by the assignee of the present invention, which is a 
layered transport protocol for a System Area Network 
(SAN) optimized to support the Virtual Interface (VI) archi- 
tecture session layer which has stringent user-space to 
user-space latency and bandwidth requirements. These 

65 requirements mandate a reliable hardware (HW) message 
transport solution with minimal software (SW) protocol 
stack overhead. The ServerNet II protocol layers for an end 
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node VI Network Interface controller/Card (NIC) and for a ordered or unordered, whether there is immediate data 

routing node are illustrated in FIG. 1. A single NIC and VI and/or whether this is the first or last packet in a session 

session layer may support one or two ports, each with its layer multi-packet transfer. Based on the VI transaction 

associated transaction, packet, link-level, MAC (media type and control information, a 32 bit Immediate data 

access) and physical layer. Similarly, routing nodes with a 5 field or a 64 bit Virtual address may follow the VI ID 

common routing layer may support multiple ports, each with number. 

its associated link-level, MAC and physical layer. yL The payload data field carries up to 512 bytes of data 

Support for two ports enables the ServerNet II SAN to be between requesters and responders and may contain a 

configured in both non-redundant and redundant (fault pad byte. 

tolerant, or FT) SAN configurations as illustrated in FIG. 2 io vii ^ CRC field contains a checksum computed over 
and FIG. 3. On a fault tolerant network, a port of each end we entire packet, 
node may be connected to each network to provide contin- 
ued VI message communication in the event of failure of one Transaction Overview 
of the SANs. In the fault tolerant SAN, nodes may be ported ~- , . „ £ . . . , it _ VT „ 
into a single fabric or single ported end nodes may be 15 ™ 6 flow f t^rtjons through the ServerNet II 
grouped into pairs to provide duplex FT controllers. The ^^"wted^ed.Wie^tliea^ofSeod, 
fabric is the collection of routers, switches, connectors, and , RD ^ r " d a ? d RDMA write transacts. These are trans- 
cables that connect the nodes in a network. lated * *», ^ s «f on la y er mt ° a S6t ° f S, e ™ rNet 11 
r,. , „ - . .. , o vt . rr . • , transactions (request/response packet pairs). All data trans- 
Tne foUowingdescnbes general ServerNet II , terminology fcts ( readin M&hto £pu m ^ d ^ k 

andconcepfeTheuseoftheterm layer' in me following ™ volumes of data ^ a ^ f Mm o ; er a ^h?^ 

d^cnptionismendedtodescribefuncuonalityanddoesnot ^ one end node / imply interrupting 

** " ^* e another) consist of one or more such transactions. 
Two ports are supported on a NIC for both performance 

and fault tolerance reasons. Both of these ports operate Creating a Request Packet 

under the same session layer VI engine. That is, data may 25 ™ TF * , 

arrive on any port and be destined for any VI. Similarly, the ™t ^ ^ ^^° Wde * ^°!° W ^ r0UtmeS for 

Vlsontheendnodecangeneratedataforanyofthesepoits. YJ^ Send '. RD ^ W ? te ' and RDMA Read operations. 

ServerNet H packets are comprised of a series of data ^ r ° Utme t S ft 00 3 !^ \ l™, 

symbols followed by a packet framing command. Other * e H*»pnato VI queue and notify the ^hardware that 

commands, used for flowWol, virtull channel support, 30 ^/TT* * I°k T? J* ^ 

and other link management functions, may be embedded "1* ^ T °? .? T' 

within a packet. Each request or response packet defines a ^ ? a ,f \ a'T'riateT * 

variety of information for routing, transaction type, ^ ^ v PP piaej. 

verification, length and VI specific information. 3S Dual Ports and Ordering 

i. Routing in the ServerNet II SAN is destination-based _ lL 

using the first 3 bytes of the packet. Each NIC end node . In a mC wth two P orts > 11 is P ossible for a ^ 
port in the network is uniquely defined by a 20 bit Port interface to process Sends and RDMA operations from 
SNID (ServerNet Node ID). The first 3 bytes of a several different Vis m paralleL It is also possible for a large 
packet contain the Destination port's SNID or DID „ RDMA transfer from a smgle VI to be transferred on both of 
(destination port ID) field, a three bit Adaptive Control me P orts simultaneously. This latter feature is called Multi- 
Bits (ACB) field and the fabric ID bit. The ACB is used P at ^ng- 

to specify the path (deterministic or link-set adaptive) ServerNet II end nodes can connect both their ports to a 

used to route the packet to its destination port as single network fabric so that there are up to four possible 

described in the following section, 4S paths between ServerNet II end nodes. Each port of a single 

ii. The transaction type fields define the type of session 4 end node ma y have a unique ServerNet ID (SNID). FIG. 4 
layer operation that this ServerNet II packet is carrying depicts the four possible paths that End node A can use when 
and other information such as whether it is a request or sending request to End node B: 

a response and, if a response, whether it is an ACK 1) End node A SNID[0] to End node B SNID[0] 

(acknowledgment) or a NACK (negative so 2) End node A SNID[0] to End node B SNID[1] 

acknowledgment). The ServerNet II SAN also supports 3) End node A SNID[1] to End node B SNID[0] 

other transaction types. 4) End node A SNID[1] to End aode B S NID[1] 

111. Transaction verification fields include the source port FIG . 5 depicts a neiW0 rk topology utilizing routers and 

ID (SID) and a Transaction Serial Number. The trans- In FIG . 5, end nodes A-F, each having first and second 

action serial number enables a port with multiple S5 send reC eive ports 0 and 1, are coupled by a ServerNet 

requests outstanding to uniquely match responses to topology including routers R1-R5. Links are represented by 

requests. lj nes coupling ports to routers or routers to routers. A first 

iv. The Length field consists of an encoding of the number adaptive set (fat pipe) 2 couples routers RI and R3 and a 
of bytes of payload data in the packet. Payloads up to second adaptive set (fat pipe) 4 couples routers R2 and R4. 
512 bytes are supported and code space is reserved for eo Routing may be deterministic or link set adaptive. An 
future increases in payload size. adaptive link-set is a set of links (also called lanes) between 

v. The VI Session Layer specific fields describe VI two routers that have been grouped to provide higher 
information such as the VI Operation, the VIA bandwidth. The Adaptive Control Bits (ACB) specify which 
Sequence number and the Virtual Interface. ID number. type of routing is in effect for a particular packet. 

The VI Operation field defines the type of VI transac- 65 Deterministic routing preserves strict ordering for packets 

tion being sent (Send, RDMARead, RDMA Write) and sent from a particular source port to a destination port. In 

other control information such as whether the packet is deterministic routing, the ACB field selects a single path or 
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Lane through an adaptive link-set. Send transactions for a packet status is carried by a link symbol TPG (this packet 

particular VI require strict ordering and therefore use deter- good) or TPB (this packet bad) appended at the end of the 

ministic routing. packet. Since packet status is checked on each link, a packet 

RDMA transactions, on the other hand, may make use of status transition (good to bad) can be attributed to a specific 

all possible paths in the network without regard for the 5 link. The packet routing process described above is repeated 

ordering of packets within the transaction. These transac- for each router node in the selected path to the destination 

hons may use link-set adaptive routing as described below. node. 
The ACB field specifies which specific link (or lane) in this 

link-set is to be used for deterministic routing. Receiving a Request Packet 

Alternatively the ACB field can specify link-set adaptiv- ,o ket ^ at ^ destinati(m 

ity which enables the packets to dynamically choose from ^ SetmjS ^ ^zc* receiver checks its validity (e.g. 

any of the links in the link-set. destination node ID, the length is 

A sample topology wnh several different examples of Fabric ^ ifl ^ ket matches ^ 

rnuurpatning using nnx ano pain aaapuvity is snown m rio, ^ ^ ^ 

receiving port, the request field encodes 

'Multipathing allows large block transfers done with * a va .^ ^ uest ' md C * C ^ * '*£ P a c cket is 

nm , A „ j ,, 7 f. * • I. 1 L AL mvalid for any reason, the packet is discarded. Hie Server- 

RDMA Read or Write operations to simultaneously use both x T * tt • . * . . * i *• u 

„ j v i u ♦ *u *. • Net II interface may save error status for evaluation by 

S 'NIG* ^^^ti^A^Z^t^' S0ftware - K these vaUdity Checks SUCCeed> mor * 
ca g . * ncc e a a r c aracensics o any checks are made. Specifically, if the request specifies an 
one VI are expected to be bursty, multipathmg allows toe 20 RDMA Read of ^ ^ fa checked to ensure 

end node to marshal all its resources for a smgle transfer. . , uij* *t. * ^tiai *u 

«.t . ,1 . u . tU . , . . & , . r access has been enabled for that particular VI. Also, the 

Note that multipathing does not increase the throughput of . . . , „ T ~ - ,« /\ , . - . ' 

. e , r c \7Tejc ■» rr input port and Source ID of the packet are checked to ensure 

multiple Send operations from one VI. Sends from one VI v v . , . „ r A ... . . ,. 

*u , t1 , 1 o- • access to the particular VI is allowed on that input port from 

must be sent strictly ordered. Since there are no ordering t . i e k.l i * ■ ,.j / 

, _ J , . . . r A * the particular Source. If the packet is valid, the request can 

guarantees between packets originating from different ports 25 5 e ^ j etec j 

on a NIC, only one port may be used per Send. Furthermore, p 

only a single ordered path through the Network may be used, Response Packet 
as described in the following. 

A response is created based on the success (ACK 

Transaction and Packet Layers 30 response) or failure (NACK response) of the request packet. 

The transaction layer builds the ServerNet II request A successful read request, for example, would include the 

packet by filling in the appropriate SID, Transaction Serial read data 10 the ACK res P onse - The source node ID from the 

Number (TSN), and CRC Tne SID assigned to a packet ret * uest P acket ^ used as the destination node ID for the 

always corresponds to the SNID of the port the packet response packet. The response packet must be returned to 

originates from. The TSN can be used to help the port 35 ^ on g mal souroe P ort ^ P ath taken b ? the espouse is 

manage multiple outstanding requests and match the result- not necessarily the reverse of the path taken by the request, 

ing responses uniquely to the appropriate request. The CRC ^ network may be configured so that responses take very 

enables the data integrity of the packet to be checked at the different paths than requests. If strict ordering is not 

end node and by routers enroute. required, the response, like the request, may use link-set 

„ « „ xt . tt i- i . i iL i . • 40 adaptivity. The response packet is routed back to the SNID 

Following the ServerNet II hnk protocol, the packet is ^ thfJ s £ & J o{ ^ ^ ACfi fleld of 

encoded ma series of data symbols followed by a status £ / ket b a]so dllplicated 4 for ^ response packet . 

command. The ServerNet II link layer uses other commands r 

for flow control and link management. These commands ^ response can be matched with the request using the 

may be inserted anywhere in the link data stream, including ^ ^ P ackct validity checks. If an ACK response 

between consecutive data symbols of a packet. Finally, the 45 thcse tcsts > thc transaction layer passes the response 

symbols are passed through the MAC layer for transmission data to the scssion laver ' frees ^sources associated with the 

on the physical media to an intermediate routing node. request, and reports the transaction as complete. If a NACK 

response passes these tests, the end node reports the failure 

Routing of the transaction to the session layer. If a valid ACK/NACK 

„, . . f , , . , 50 response is not received within the allotted time limit, a 

The routing control function is programmable so that the ti me . out er ror is reported, 

packet routing can be changed as needed when the network _ A A . , 

configuration changes (e.g., route to new end nodes). Router TT ™ e re< * uestor ™ stream many *nctly ordered ServerNet 

nodes serve as crossbar switches; a packet on any incoming U ™B« «oto the wire before receiving an acknowledg- 

(receive) side of a link can be switched to the outgoing 55 P*** ^ ^ T. FT^ 1 * 

(transmit) side of any link. As the incoming request packet have U P to 128 P ackets o^standong per VI. 

arrives at a router node, the first three bytes, containing the ^ hardware can operate in one of two modes with 

DID and ACB fields, are decoded and used to select a link respect to generating multiple outstanding request packets: 

leading to the destination node. If the transmit side of the 1. The hardware can stream packets from the same VI 

selected link is not busy, the head of the packet is sent to the 60 sen ^ queue onto the wire, and start the next descriptor before 

destination node whether or not the tail of the packet has receiving all the acknowledgments from the current descrip- 

arrived at the routing node. If the selected link is busy with tor. This is referred to as "Next Descriptor After Launch" or 

another packet, the newly arrived packet must wait for the NDAL. 

target port to become free before it can pass through the 2. The hardware can stream packets to a single descriptor 

crossbar. 65 on t 0 th c DU i wait for all the outstanding acknowledg- 

As the tail of the packet arrives, the router node checks the ments to complete before starting the next descriptor. This is 

packet CRC and updates the packet status (good or bad). The referred to as "Next Descriptor after ACK" or NDAA. 
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The choice of NDAL or NDAA modes of operation is 
determined by how strongly ordered the packets are gener- 
ated. 

Ordered and unordered messages may be mixed on a 
single VI. When generating an unordered message, the 5 
requester must wait for completion of all acknowledgments 
to unordered packets before starting the next descriptor. 

Ordering of Send Packets Presented to Transaction 

Layer 10 

The VI architecture has no explicit ordering rules as to 
how the packets that make up a single descriptor are ordered 
among themselves. That is, VIA only guarantees the mes- 
sage ordering the client will see. For example, VIA requires 5 
that Send descriptors for a particular VI be completed in 
order, but the VIA specification does not say how the packets 
will proceed on the wire. 

ITie ServerNet II SAN requires that all Send packets 
destined for a particular VI be delivered by the SAN in strict 20 
order. As long as deterministic routing is used, the network 
assures strict ordering along a path from a particular source 
node to a particular destination node. This is necessary 
because the receiving node places the incoming packets into 
a scatter list. Each incoming packet goes to a destination 25 
determined by the sum total of bytes of the previous packets. 
The strict ordering of packets is necessary to preserve 
integrity of the entire block of data being transferred because 
incoming packets are placed in consecutive locations within 
the block of data. Each packet has a sequence number to 30 
allow the receiver to detect an out of order, missing, or 
repeated packet 

There are two ways for an end node to meet these ordering 
requirements: 

a. The end node can wait for the acknowledgment from 35 
each Send packet to complete before starting another Send 
packet for that VI. By waiting for each acknowledgment, the 
end node does not have to worry about the network provid- 
ing strict ordering and can choose an arbitrary source port, 
adaptive link set and destination port for each message. 40 

b. The end node can restrict all the Send operations for a 
given VI to use the same source port, the same destination 
port, and a single adaptive path. By choosing only one path 
through the network, the end node is guaranteed that each 
Send packet it launches into the network will arrive at the 45 
destination in order. 

The second approach requires the VIA end node to 
maintain state per VI that indicates which source port 
destination port and adaptive path is currently in use for that 5Q 
particular VI. Furthermore, the second approach allows the 
hardware to process descriptors in the higher performance 
NDAL mode. 

With the second approach, Send packets from a single VI 
can stream onto the network without waiting for their 55 
accompanying acknowledgments. An incrementing 
sequence number is used so the destination node can detect 
missing, repeated, or unordered Send packets. 

Ordering of RDMA Packets 

60 

RDMA operations have slightly different ordering 
requirements than Send operations. An RDMA packet con- 
tains the address to which the destination end node writes the 
packet contents. This allows multiple RDMA packets within 
an RDMA message to complete out of order. The contents of 65 
each packet are written to the correct place in the end node's 
memory, regardless of the order in which they complete. 
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RDMA request packets may be sent ordered or unordered. 
A bit in the packet header is set to 1 for ordered packets and 
is set to 0 for unordered packets. As will be explained later, 
this bit is used by the responder logic to determine if it 
should increment its copy of the expected sequence number. 
Sequence numbers do not increment for unordered packets. 
The end node is free to use different source ports, destination 
ports and adaptive paths for the packets. This freedom can 
be exploited for a performance gain through multipathing; 
simultaneously sending the RDMA packets of a single 
message across multiple paths. 

When RDMA Read or Write packets are sent over a path 
that does not exhibit strict ordering with the Send packets 
from the same VI, care must be taken when launching 
packets for the following message. The next message cannot 
be started until the last acknowledgment of the RDMA Read 
or Write operation successfully completes. 

In other words, when multipathing is used to generate 
RDMA Read or Write requests, the hardware must operate 
in the NDAA mode. This ensures the RDMA Read or Write 
is completed before moving on to subsequent descriptors. 

An end node may choose to send RDMA packets strictly 
ordered. This can be advantageous for smaller RDMA 
transfers as the hardware can operate in NDAL mode. The 
VI can proceed to the next descriptor immediately after 
launching the last packet of a message that is sent strictly 
ordered (and hence used incrementing sequence numbers). 

Ordering of Generated Response Packets at the 
Responder 

The ServerNet II end node must respond to incoming 
Send requests and RDMA Write requests from a particular 
VI in strict order, and must write these packets to memory 
in strict order. 

The ServerNet II end node must also respond to incoming 
RDMA Read requests from a particular VI in strict order. 

Because response packets are transported by the network 
in strict order, the requestor will receive all incoming 
response packets for a particular VI in the same order as that 
in which the corresponding requests were generated. 

VIA Message Sequence Numbers 

The ServerNet SAN uses acknowledgment packets to 
inform the requestor that a packet completed successfully. 
Sequence numbers in the packets (and acknowledgments) 
are used to. allow the sender to support multiple outstanding 
requests to ensure adequate performance and to be able to 
recover from errors occurring in the network. 

FIG. 6 is a graph depicting the generation, checking, and 
updating of VIA sequence numbers at requestor and 
responder nodes. In FIG. 6, time increases in the downward 
direction. Requests are indicated by solid arrows directed to 
the right and responses by dotted arrows directed to the left. 

Sequence Number Initialization 

The requestor and responder logic each maintain an 8 bit 
sequence numbered for each VI in use. When the VI is 
created, the requester on one node and the respond on the 
remote node initialize their sequence numbers to a common 
value, zero i the preferred embodiment. 

After this, the requester places its sequence number into 
each of the outgoing request packets. As depicted in FIG. 6, 
the sequence number, SEQ, is included in each request 
packet. The responder compares the sequence number from 
the incoming request packet with the responder' s local copy. 
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The responder uses this comparison to determine if the packet and throws it away. The receive logic in the VI 

packet is valid, if it is a duplicate of a packet already is not stopped and the responder does not increment its 

received, or if it is an out-of-sequence packet. An out-of- sequence number; or 

sequence packet can only happen if the responder missed an a duplicate packet (which is being resent because the 

incoming packet. The responder can choose to return a 5 requestor must not have received an earlier ACK) in 

'sequence error NACK packet' or it can simply ignore the wn ich case the responder ACKs the packet and throws 

out-of-sequence packet. In the latter case, the requester will j t away> T f tne re q Uest had been an RDMA Read, the 

have a time-out on the request (and presumably on the responder completes the read operation and returns the 

packet the responder missed) and initiate error recover. data with a positive acknowledgment. 

Generating a sequence error NACK packet is preferred as it 10 An example of the responder checking sequence numbers 

forces the requester to start error recovery more quickly. f or ordered and unordered packets is given in FIG. 6. In FIG. 

lne following describes how the sequence numbers are 6, during the first two Send transactions, the responder 

generated and checked. checks that the SEQ in the packet matches the local copy of 

Rsp. SN. Since the Send packets include ACB indicating 

Generating Sequence Numbers for Request Packets 15 ordered packets, the Rsp. SN is incremented after each 

When transmitting ordered packets (i.e. transfers are on a response packet is transmitted. At the end of the first two 

specific source port to a specific destination port and the Send transactions, Rqst. SN and Rsp. SN both equal 6. The 

ACB specifies a specific lane) the request sequence number packets for the RDMA include an ACB indicating unordered 

is incremented after each packet is sent. When transmitting on receipt is allowed. Neither the requestor or responder incre- 

unordered packets (i.e. multipathing is used and/or the ACB 20 ments lts local C0 Py of SN * ^ at ^ end of the mMA 

bits specify full link set adaptivity) the request sequence transaction both Rqst. SN and Rsp. SN-6. The first packet 

number is not incremented after such a packet is sent. of ^ subsequent Send transaction has SEQ-6 and SEQ 

t? i • rirn , . c 4 ^ o , matches the local copy of Rsp. SN. Since Send packets are 

For example, in FIG. 6, during the first two Send , * . * i i rn „w 

*. \ , r.t. . u ordered, the responder increments its local copy of Rsp. SN. 

transactions, the local copy of the request sequence number 25 

is incremented after the packet is sent (Rqst. SN-0 to 6). For Sequence Numbers on Response Packets 
the RDMA operation, which sends 2500 bytes unordered, 

the requester does not increment local copy of the request When generating either a positive or negative 

sequence number (Rqst. SN-6). The requester does not acknowledgment, the responder logic copies the incoming 

increment the local copy of the SN until after the first packet 30 sequence number and uses it in the sequence number field of 

of the Send following the RDMA is transmitted. me acknowledgment. 

Send packets are typically sent fully ordered lest the ^ requestor logic matches incoming responses with the 

requestor have to wait for an acknowledgment for each originating request by comparing the SourcelD, VI number, 

packet before proceeding to the next. On the other hand, Sequence number, transaction type, and transaction Serial 

RDMApackets may be sent either ordered or unordered. To 35 Number (TSN) with that of the originating request, 

take advantage of multipathing, a requestor must use unor- „ _ , t 

dered RDMApackets. Error Rccovcr y and Path Statc 

The sender guarantees to never exceed the windowsize Error recovery is initiated by the requesting node when- 

mimber of packets outstanding per VI. If S is the number of ever me requestor fails to get a positive acknowledgment for 

bits in the sequence number, then the windowsize is 2**(S- 40 each of its request packets. A time-out or NACK indicating 

1). a sequence number error, can cause the requestor's Kernel 

Apacket is outstanding until it and all its predecessors are A ^ aX to start error recover y- 
acknowledged The requestor does not mark a descriptor Em)r reoo involyes three bask gt 

done until all packets requested by that descriptor are 

positively acknowledged. 45 1) Completing a barrier operations) to flush out any 

errant request or response packets. 
Checking Sequence Numbeis on Incoming Request 2) Disabling a bad patn tf ^ barrier oper ation failed. 

3) Retransmitting from the earliest packet that bad failed. 

The destination node responding to the incoming request 5Q The first two steps will now be described with reference 

packet checks each incoming request packet to verify its to FIG. 7, which is two interlocked state diagrams showing 

sequence number against the responded local copy. the state that software on the requestor and responder moves 

The responder logic compares its sequence number with through for each path. In FIG. 7, dashed lines represent 

the packet's sequence number to determine if the incoming Kernel to Kernel Supervisory Protocol messages that modify 

packet is either: S5 the remote node's state. 

the expected packet it is looking for (i.e., the packet's The ServerNet architecture allows multiple paths between 

sequence number is the same as the sequence number end nodes. The requestor repeats these two basic steps on 

maintained by the responder logic), in which case the each path until the packet is transmitted successfully, 

responder processes the packet and if all other checks The requestor and responder SW each maintain a view of 

are passed, the packet is Acknowledged and committed 60 the state of each path. The requestor uses its view of the path 

to memory. If the transaction is ordered, the responder state to determine which path it uses for Send and RDMA 

increments its sequence number. If the transaction is operations. The responder uses its view of the path state to 

unordered, the responder does not increment its determine which input paths it allows incoming requests on. 

sequence number; The responder logic maintains a four bit field (ReqlnPath 

an out-of-sequence packet (which means an earlier 65 Vector) for each VI in use. Each of the four bits corresponds 

incoming packet must have gotten lost), beyond the one to one of the four possible paths between the requestor's two 

it is looking for in which case the responder NACKs the ports and the responder's two ports. The requestor only 
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accepts incoming requests from a particular source or des- send four barrier operations (one for each lane) or it can 

tination port if the corresponding bit in the ReqlnPath Vector maintain state, telling how many lanes are in use on the fat 

is set pipes between any given source/destination pair. The barrier 

The requestor and responder communicate using the need only be sent along the same path as the original request 

kernel-to-kemel Supervisory protocol to communicate path 5 that failed. There is no requirement for the barrier to be sent 

state changes. from or to the same VI. 

The requester's view of the path state transitions from Turning now to the third step, i.e., retransmitting from the 

good to bad whenever the requestor fails to get an acknowl- earliest packet that failed to receive a positive 

edgment (either positive or negative) to a request. The acknowledgement.^ after] After notification of the error, 

requestor detect the lack of an ACK or NACK by getting a 10 requestor retransmits the packets starting at (or before) the 

ume-out error. ^ requestor can attempt a barrier operation ket ^ ^ tQ receiye a itive acknowledgement . 

on the path to see if the failure is permanent or transient If ^ tof can restart £ windowSize number of 

the barrier succeeds, the path is considered good and the 1 * -w. j 1 * 1 1 j j lL 

original operation can be retried. If the barrier fails, the P ackets ' J* 6 ™? onder lo^cac^owledges and then ignores 

requestor must resort to a different good path. any packets that are resent if they have already been stored 

Before the requestor can try a different good path, the 15 m the r ^ ve ? ueue " ™™ ™ neci P acket ^^ached^he 

requestor must inform the destination that the original path responder logic can tell from the sequence number that it is 

is bad. This is done by any path possible. For example, in D0W ^ to resume me data to the recejve 1 ueue - 

VIA the Kernel Agent to Kernel Agent Supervisory Protocol Examples of retransmission after failure to receive a 

is used. After the destination is informed the path is bad, the response are depicted in FIGS. 8 and 9. In FIG. 8, the request 

destination disables a bit in a four bit field (ReqlnPath 20 packet with SEQ-2 is corrupted. The missing request is 

Vector), thereby ignoring incoming requests from that path. detected by the responder on the next packet and NACKed 

The requester then stops using the bad path until a subse- (Negative Acknowledged) and all subsequent packets are 

quent barrier transaction determines that the path is good. thrown away and NACKed. The requestor resets its send 

After the destination acknowledges the supervisory protocol engine to start generating packets at the one that failed to 

message, indicating that the destination has disabled 25 receive an ACK (in this case Rqst. SN=2). The responder 

requests from the offending path, the requestor is free to recognizes the SEQ»2 and accepts the packets, 

retry the message on a different path. In FIG. 9, the response packet with SEQ-1 is corrupted. 

After a time-out error, the requester attempts to bring the The missing response is detected when the requestor times 

path back to a useful state by completing a barrier operation. out its transaction. The requestor resets its send engine to 

The barrier operation ensures there are no other packets in 30 start generating packets at the one that failed to receive an 

any buffer that might show up later and corrupt the data ACK (in this case rqst.SN-1). The responder recognizes the 

transfer. resent packets as already having been received, acknowl- 

Barrier operations are used in error recovery to flush any edges them and discards the data, 

stale request or stale response packets from a particular path Note that the response packets for this particular RDMA 

in the SAN. A path is the collection of ServerNet links 35 transaction all have the same value of SEQ because the 

between a specific port of two end nodes. request SNs and response SNs are not incremented for 

A VIA barrier operation is done with a RDMA Write RDMA transactions that are unordered. In this case, the 

followed by an RDMA Read. A number chosen from a large TSNs are utilized by the requestor to match response packets 

number space (either incrementing or pseudo random) is ^ to outstanding requests. 

written to a fixed location (e.g. a page number agreed to, a Error recovery places several requirements on the request- 
priori, by the kernel agent-to-kernel agent Supervisory Pro- or's KA (Kernel Agent, the kernel mode driver code respon- 
tocol and either a fixed or random offset within the page). sible for SAN error recovery): 

The number is then read back with an RDMA read. If the 1) The KAmust determine the sequence number to restart 

read value matches the write value, then the barrier sue- 45 with. 

ceeded and there are guaranteed to be no more Send or 2) The KAmust determine the proper data contents of the 

RDMA request or response packets on that path between the packet to be resent. 

requester and responder. 3) orc jer for the KA to determine the appropriate 

If the RDMA operation fails because the number read sequence number, it must be aware of how the hardware 

back does not match the number written, then the barrier is 50 packetizes data under any given combination of descriptors, 

tried again. This could have happened because a previous data segments, page crossings etc. 

response in the network came back and fulfilled the barrier Note the responder side does not require KA involvement 

Note that the barrier needs to be done separately on all (unless a barrier operation fails), 

paths the RDMA operation could have taken. That is, if the The invention has now been described with reference to 

RDMA operation was being generated from multiple source S5 the preferred embodiments. Alternatives and substitutions 

ports (multipathing) and was using full link adaptivity (the win now be apparent to persons of skill in the art. For 

packets were allowed to take any one of four possible example, while the invention has been described in the 

"lanes"), then separate barrier operations must be done from context of the ServerNet II SAN, the principles of the 

each source port to each destination port, over each of the invention are useful in any network that utilizes multiple 

possible link adaptive paths. 60 pams between end nodes. Accordingly, it is not intended to 

The barrier operation must be done for each of the limit the invention except as provided by the appended 

possible "lanes" between a specific Source port and Desti- claims, 

nation port. A barrier done on one VI ensures that all other What is claimed is: 

Vis using that source port and destination port have no 1. A method for error detection and recovery in a system 
remaining request or response packets lurking in the SAN, 65 with a plurality of networked nodes, including source and 
If a path traverses a "fat-pipe" a separate barrier must be destination, communicating with each other via paths, corn- 
sent down each lane of the fat pipe. SW can either blindly prising: 
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creating a request packet for a request transaction, 

routing the request packet from the source to the desti- 
nation via a particular path; 

maintaining at each of the source and destination a status 
for each of its respective paths; 5 

detecting a time-out error for failure within a predeter- 
mined time limit to receive at the source an acknowl- 
edge (ACK) packet or a negative- acknowledge 
(NACK) packet in response to the request packet, the 
time-out error created from a failure of the particular 10 
path; 

performing a barrier transaction via the particular path to 
determine if the failure of the particular path is transient 
or permanent; 

periodically repeating the barrier transaction via the par- i$ 
ticular path in order to determine if its failure is cured; 

re-transmitting the request packet via the particular path if 
the failure is transient; and 

if the failure is permanent, 
updating at the source the status for the particular path, 20 
and 

routing to the destination information about the 
updated status via an alternate path to prompt updat- 
ing at the destination of the status for the particular 
pat wherein a failed path is not used. 

2. The method of claim 1, wherein when the transaction 25 
is ordered the routing is deterministic preserving strict 
ordering of request, ACK and NACK packets by selecting a 
single path as the particular path. 

3. A method as in claim 1, wherein when the transaction 

is not ordered the routing is link set adaptive enabling 30 
dynamic change of request, ACK and NACK packet routing 
by selection of paths from a link set. 

4. A method as in claim 1, wherein the request packet is 
discarded if it is invalid. 

5. A method as in claim 1, wherein the request packet 35 
contains a sequence number to indicate its place in a 
sequence of request packets and adaptive control bits to 
indicate the routing as ordered or not ordered. 

6. A method as in claim 1, further comprising: 
maintaining a sequence number in each of the source and 40 

destination, the sequence numbers being initialized to a 
common value; and 
including the sequence number in the request packet, 
wherein the sequence number is tracked by the source 
and destination, and wherein the NACK packet is 45 
created if a mismatch is found between the sequence 
number in the request packet and the sequence number 
at the destination. 

7. A method for error detection and recovery in a system 
with a plurality of networked nodes, including source and 
destination, communicating with each other via paths, com- 
prising: 

maintaining a sequence number in each of the source and 
destination, the sequence numbers being initialized to a 
common value; S5 

creating a request packet for a request transaction, the 
request packet containing the sequence number; 

routing the request packet from the source to the desti- 
nation via a particular path, wherein if the request 
transaction is ordered, the sequence number at the 60 
source is incremented; 

maintaining at each of the source and destination a status 
for each of its respective paths; 

checking the request packet for integrity and, if the 
request packet is valid, matching the sequence number 65 
in the packet with the sequence number at the destina- 
tion; 



,981 Bl 

14 

creating an acknowledge (ACK) packet for a response 
transaction if the request packet is valid and the 
sequence number matching succeeds, and routing the 
ACK packet to the source, the ACK packet containing 
the sequence number from the request packet; 

creating a negative acknowledge (NACK) packet for the 
response transaction if the sequence number matching 
fails, and routing the NACK packet to the source, the 
NACK packet containing the sequence number from 
the request packet; 

incrementing the sequence number at the destination if the 
request transaction is ordered and the sequence number 
matching succeeds; 

detecting a time-out error for failure within a predeter- 
mined time limit to receive at the source the ACK or the 
NACK packets in response to the request packet, the 
time-out error created from a failure of the particular 
path; 

performing a barrier transaction via the particular path to 
determine if the failure of the particular path is transient 
or permanent; 

periodically repeating the barrier transaction via the par- 
ticular path in order to determine if its failure is cured; 

re-transmitting the request packet via the particular path if 
the failure is transient; and 

if the failure is permanent, updating a status at the source 
for the particular path and routing to the destination 
information about the updated status via an alternate 
path to prompt updating of the status for the particular 
path at the destination, wherein a failed path is not used. 

8. A method as in claim 7, wherein when the transaction 
is ordered the routing is deterministic preserving strict 
ordering of packets by selecting a single path as the par- 
ticular path. 

9. A method as in claim 7, wherein when the transaction 
is not ordered the routing is link set adaptive enabling 
dynamic change of packet routing by selection of paths from 
a path set. 

10. A method as in claim 7, wherein the request packet is 
discarded if it is invalid. 

11. A method as in claim 7, wherein the request packet 
contains adaptive control bits indicating whether the trans- 
action is ordered or not ordered. 

12. A system for error detection and recovery with a 
plurality of networked nodes, including source and 
destination, communicating with each other via paths, com- 
prising: 

means for creating a request packet for a request trans- 
action; path means for routing the request packet from 
the source to the destination via a particular path; 

means for maintaining at each of the source and destina- 
tion a status for each of its respective paths; 

means for detecting a time-out error for failure within a 
predetermined time limit to receive at the source an 
acknowledge (ACK) packet or a negative-acknowledge 
(NACK) packet in response to the request packet, the 
time-out error created from a failure of the particular 
path; 

means for performing a barrier transaction via the par- 
ticular path to determine if the failure of the particular 
path is transient or permanent, including means for 
periodically repeating the barrier transaction via the 
particular path in order to determine if its failure is 
cured; 

means for re-transmitting from the source the request 
packet via the particular path if the failure is transient; 
and 
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if the failure is permanent, 

means for updating at the source the status for the 
particular path, and 

means for routing to the destination information about 
the updated status via an alternate path to prompt 5 
updating at the destination of the status for the 
particular path, wherein a failed path is not used. 

13. The system of claim 12, wherein when the transaction 
is ordered the routing is deterministic preserving strict 
ordering of request, ACK and NACK packets by selecting a 10 
single path as the particular path. 

14. A system as in claim 12, wherein when the transaction 
is not ordered the routing is link set adaptive enabling 
dynamic change of request, ACK and NACK packet routing 
by selection of paths from a link set 15 

15. A system as in claim 12, further comprising: 
means for discarding the request packet if it is invalid. 

16. A system as in claim 12, wherein the request packet 
contains a sequence number to indicate its place in a 
sequence of request packets and adaptive control bits to 20 
indicate the routing as ordered or not ordered. 

17. A system as in claim 12, further comprising: 
means for maintaining a sequence number in each of the 

source and destination, the sequence numbers being 
initialized to a common value; and 25 
means for including the sequence number in the request 
packet, wherein the sequence number is tracked by the 
source and destination, and wherein the NACK packet 
is created if a mismatch is found between the sequence 3Q 
number in the request packet and the sequence number 
at the destination. 

18. A system for error detection and recovery in a system 
with a plurality of networked nodes, including source and 
destination, communicating with each other via paths, com- 35 
prising: 

means for maintaining a sequence number in each of the 

source and destination, the sequence numbers being 

initialized to a common value; 
means for creating a request packet for a request 40 

transaction, the request packet containing the sequence 

number; 

means for routing the request packet from the source to 
the destination via a particular path, including means 
for incrementing the sequence number at the source if 45 
the request transaction is ordered; 

means for maintaining at each of the source and destina- 
tion a status for each of its respective paths; 

means for checking the request packet for integrity, 
including means for matching the sequence number in 
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the packet against the sequence number at the destina- 
tion if the request packet is valid; 

means for creating an acknowledge (ACK) packet for a 
response transaction if the request packet is valid and a 
sequence number matching succeeds, including means 
for routing the ACK packet to the source, the ACK 
packet containing the sequence number from the 
request packet; 

means for creating a negative acknowledge (NACK) 
packet for the response transaction if the sequence 
number matching fails, including means for routing the 
NACK packet to the source, the NACK packet con- 
taining the sequence number from the request packet; 

means for incrementing the sequence number if the 
request transaction is ordered and the sequence number 
matching succeeds; 

means for detecting a time-out error for failure within a 
predetermined time limit to receive at the source the 
ACK or the NACK packets in response to the request 
packet, the time-out error created from a failure of the 
particular path; 

means for performing a barrier transaction via the par- 
ticular path to determine if the failure of the particular 
path is transient or permanent, including by periodi- 
cally repeating the barrier transaction via the particular 
path in order to determine if its failure is cured; 

means for re-transmitting the request packet via the 
particular path if the failure is transient; and 

if the failure is permanent, 

means for updating a status at the source for the 

particular path, and 
means for routing to the destination information about 
the updated status via an alternate path to prompt 
updating of the status for the particular path at the 
destination, wherein a failed path is not used. 

19. A system as in claim 18, wherein when the transaction 
is ordered the routing is deterministic preserving strict 
ordering of request, ACK and NACK packets by selecting a 
single path as the particular path. 

20. A system as in claim 18, wherein when the transaction 
is not ordered the routing is link set adaptive enabling 
dynamic change of request, ACK and NACK packet routing 
by selection of paths from a path set. 

21. A system as in claim 18, further comprising: 
means for discarding the request packet if it is invalid. 

22. A system as in claim 18, wherein the request packet 
contains adaptive control bits indicating whether the trans- 
action is ordered or not ordered. 

* * * - * * 
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